Throughout the corona virus pandemic, the various measures taken by regional legislatures to “reduce the spread” were often a subject of scrutiny. In this paper, we present an ANCOVA-coded regression model, in order to determine whether there were regional differences in the number of new corona virus cases, and to determine whether various government intervention factors were effective in reducing the number of corona virus cases. The final model yielded an \(R^2_{adj}\) value of \(46.39\%\), and demonstrated that there were regional differences in the number of corona virus cases relative to population. Additionally, the various government intervention factors had a significant effect on the number of corona virus cases. In general, as the severity of the restrictions increased, the number of new corona virus cases was shown to have increased. Finally, we conclude that causal inference is unfeasible for these data, due to the violations present in the SUTVA assumptions.
For the past two years, the corona virus pandemic has been a subject of vexation. A specific concern that has arisen throughout the pandemic, is whether or not certain geographical areas, are more prone to outbreaks than others, and whether government interventions to reduce corona virus infections were effective. In this paper, we shall utilize Multiple Linear Regression to determine whether or not certain geographical regions were more prone to corona virus outbreaks than others, and whether government efforts to reduce corona virus infections were effective. We then continue to explore the potential differences in the regions and government intervention effects, using techniques such as likelihood ratio tests. Finally, we discuss whether or not causal inference can be made directly on the factors of interest.
The data of interest, was obtained from the World Health Organization (WHO), and the COVID19 Data Hub. In order to examine whether government intervention (effectively) reduced the number of corona virus cases, we selected four variables representing various forms of government interventions/restrictions. Of interest to us in these variables, are their effects on the number of new cases. Presented below, are these indicator variables, along with a brief description on its contents, and coding:
facial_coverings: An indicator variable pertaining to the severity of the facial-covering requirements, for the country of the observation.
Variable Coding:
0 - No policy
1 - Recommended
2 - Required in some public spaces outside the home with other people present, or when social distancing is not possible.
3 - Required in all public spaces, with people present.
4 - Required in all public spaces, regardless if people are present or not.
school_closing: An indicator variable pertaining to the severity of the (temporary) shut down of in-person schools.
Variable Coding:
0 - No restrictions.
1 - Recommended.
2 - Required for some grade levels.
3 - Required for all grade levels.
gatherings_restrictions: An indicator variable pertaining to the restrictions imposed on public gatherings, for the observation’s country.
Variable Coding:
0 - No restrictions
1 - Restrictions for very large gatherings.
2 - Restrictions for large gatherings.
3 - Restrictions for medium-sized gatherings.
4 - Restrictions for small gatherings.
cancel_events: An indicator variable pertaining to the restrictions imposed on local events.
Variable Coding:
0 - No restrictions.
1 - Recommended to cancel the event.
2 - Required to cancel the event.
These specific variables were selected among other government restriction indicators, as they were among the most controversial restrictions taken up by various governments. Additionally, the effectiveness of these measures were under constant scrutiny. Next, we utilized the following variables from the WHO’s COVID19 - Data bank:
WHO_region: The region of the observation’s occurrence.
Regions:
Africa
Americas
Europe
South-East Asia
Western Pacific
New_cases: The number of newly reported cases.
New_deaths: The number of newly reported deaths resulting from corona virus, in the past seven days.
In these data, it is worth noting that there were negative-valued observations in the New_cases variable. The reason for this obscurity, is that corrections were submitted into the data, to account for false positive test results. Additionally, there was coding present within the indicator variables, that were coded with a negative sign. These negative codes indicate that it represents a best guess of the policy enforced in the observation’s country. For example: a \(-1\) in the cancel_events column would imply that the observation’s country likely recommended the cancellation of events.
For the purposes of this paper, we produce an “ANCOVA-styled” multiple regression model, treating the number of New_cases as our response variable, with the WHO_region , facial_coverings, school_closing, gatherings_restrictions, and cancel_events as factors, and the New_deaths variable as a covariate. In order to ensure the validity of our results, we will exclude any negative-coded indicator variables present within the data. Additionally, we will remove any correction in the New_cases variable, in order to create a more consistent trend.
We argue that it is more efficient to compare each region rather than country, as:
Countries within the same region are likely to experience more international travel from its neighboring countries, and could lead to spatial dependencies in the number of corona virus cases.
Any model comparing the different regions will have less parameters than a model comparing each individual country, and hence, will result in a lower error variance.
Countries within each region are culturally similar; wearing face masks is a common occurrence in the East Asian countries, but is not so in the American countries.
For the purposes of model building, if we compare by regions instead of by countries, we reduce the error variability – leading to increased power for tests of significance.
To begin, we first conduct some preliminary analysis of the dataset. One aspect of the data that is important to consider, is the number of missing values present. In order to evaluate how many missing values that we may be dealing with, we will utilize the gg_miss_var function, from the naniar package.
Figure 1: Missing Values by Variable
Presented from Figure 1, is a missing value plot, by variable for our dataset. As evident from the plot above, there are no missing values present within the dataset, after filtering out the negative-valued indicator observations. This will make the process of model building and data analysis easier.
Next, it may be useful to analyze the number of new corona virus cases by region, in order to gain an understanding of how the distribution of new cases differ by region.
| Region | New Cases |
|---|---|
| Africa | 7420486 |
| Americas | 12966056 |
| Eastern Mediterranean | 10524252 |
| Europe | 98533412 |
| South-East Asia | 3166716 |
| Western Pacific | 10143842 |
Table 1 shows the number of new cases per region (as designated by the World Health Organization). However, while it would be easy to conclude that Europe experienced the most corona virus cases, this summary does not take into account the population residing in each region. To see this, consider a numerical summary, as presented below:
## WHO_region New_cases Cumulative_cases
## Africa :28740 Min. : 0 Min. : 0
## Americas :17727 1st Qu.: 0 1st Qu.: 926
## Eastern Mediterranean:13510 Median : 41 Median : 14487
## Europe :30305 Mean : 1388 Mean : 262804
## South-East Asia : 3013 3rd Qu.: 498 3rd Qu.: 141700
## Western Pacific : 9557 Max. :500563 Max. :21127104
## New_deaths Cumulative_deaths
## Min. : -43.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 11.0
## Median : 0.0 Median : 217.5
## Mean : 15.6 Mean : 4652.1
## 3rd Qu.: 6.0 3rd Qu.: 2379.0
## Max. :8786.0 Max. :196818.0
As evident from the summary function, we can see that the majority of the observations in this dataset were from Europe. The region with the least amount of observations in this dataset, was South East Asia. This unbalance in the number of observations per region is to be expected, as we are dealing with observational data. Next, one may notice that the summaries for the New_cases and New_deaths variables have negative values as their minimum. As stated earlier, the reason behind these negative numbers, is that countries made corrections on the number of corona virus cases on previous days. We can see that on average, there were \(1,383\) New cases for a given observation in this dataset. However, the median number of new cases per observation, was only \(40\). This implies that the distribution of new cases was heavily skewed right. The most probable reason for this scenario, is due to differences in population between the countries. Large countries such as the United States, or, China, are bound to have a higher population, and hence, a (potentially) higher number of new cases. Therefore, in order to have a more interpretable response variable, we shall standardize each region’s number of new cases by their respective populations.
Another potential issue in our response variable, is the number of \(0\) values present. For reference to the number 0-valued response observations, consider Figure 2 below.
Figure 2: Dotplot of Response Variable Values
As evident from Figure 2 , \(0\) appears to be the mode of the New_cases variable. This suggests that zero-inflation may pose a potential issue for modeling the number of new corona virus cases.
To re-scale the data, we will utilize the population for each country in a region (as drawn from The World Data Bank1), and divide each country’s number of New_cases in a particular region, by the summed population of each country in the region. For the sake of interpretability, we will also multiply these newly-scaled data by a factor of \(100,000\). Finally, to make any future transformations on the data easier, we will add to each observation a value of 1, ton ensure that each of the scaled observations will be non-negative.
\[ Y_{ij}^{(1)} = \frac{Y_{ij}}{W_J}*100,000 + 5 \\ i \cong \text{Who Region, } j=\{1, 2, ... n_i\}, W_i \cong \text{Population of i'th Who Region} \]
| Region | New Cases (Scaled) |
|---|---|
| Africa | 144443.32 |
| Americas | 89948.93 |
| Eastern Mediterranean | 69688.79 |
| Europe | 164241.80 |
| South-East Asia | 15223.64 |
| Western Pacific | 48335.14 |
As evident from Table 2, we can see that there are obvious differences in the (scaled) number of new cases between the regions. For instance — Europe and Africa have a much higher (scaled) number of new cases, than the other regions. A comparison in the (scaled) number of corona virus cases by region will be conducted via a statistical approach in the results section.
We will now proceed to determine the distribution of mask mandates, by WHO_region, as presented in the waffle charts below.
Figure 3: Waffle Chart for Proportions of Mask Mandates by Region
As evident from Figure 3, the differing regions had differences in their respective mask requirements. For Africa, it was most common for countries to require masks in all public spaces. For Europe, most countries required masks in public areas to an extent. For the Americas, most countries required masks in public areas. For the West Pacific, most countries either had no policy. Finally, the majority of South-East Asia required masks in some public areas.
Next, we shall analyze the distribution of our response variable: the (scaled) number of new cases, using a violin plot.
Figure 4: Violin Plot of (Scaled) New Cases by Region
As evident from Figure 4, it appears that there are minor differences in the number of new cases, between the different regions. This suggests that it would be worth including Who_region as a factor in our regression model. Additionally, it is evident that these distributions are extremely right skewed, with multiple, notable, outliers.
Next, we shall analyze the case mortality rate, by region. According to the Center for Disease Control (CDC), the case mortality for a given region can be defined as the number of deaths caused by the disease, divided by the total number of cases. Presented in Table 3 below, is the case mortality rate for each WHO-designated region. These rates were calculated by taking the average of the ratio of cumulative deaths and cumulative cases, within each region.
As evident from Table 3, most of the case-mortality rates appeared to be around 1%. However, the Eastern Mediterranean region had a case-mortality rate of around 3% – around double that of the other regions. Additionally, we can see that the Western Pacific had a much lower mortality rate than the other WHO regions. A possible reason for these differences in the case mortality rates, are that different regions may not have as many testing resources as others; Regions that contain more impoverished countries, may not have the same testing capabilities as regions with wealthier countries.
Next, we shall analyze how the number of cases has changed through time, using a time-series plot. Presented in Figure 5 , is a LOESS-smoothed plot, of the number of new corona virus cases, throughout the pandemic. Each separate line, represents the number of cases for a different region.
Figure 5: LOESS Smoothed Number of Cases over Time
As evident from the LOESS-smoothed time series plot, we can see that in the beginning of the corona virus pandemic, most WHO designated regions experienced similar patterns in the number of corona virus cases. However, starting around 2021, the number of corona virus cases appeared to have change relative to their respective regions of occurrence. For instance: in 2021, Europe showed a significant increase in the number of corona virus cases. More recently, we can see that the Western-Pacific region has experienced a dramatic increase in the number of corona virus cases.
Next, we will view the distribution of the number of new cases, with respect to the different facial covering restrictions.
Figure 6: Violin Plot of (Scaled) New Cases by Facial Covering Restrictions
As presented in the Figure 6, there appears to be differences in the distributions of new cases by facial covering restrictions. Additionally, it appears that the variance between the distributions differ significantly – which may present a potential issue when fitting our model. Again, it appears that these conditional distributions are severely right-skewed. Next, we will analyze the distribution of the number of new cases with respect to the different gathering restrictions.
Figure 7: Violin Plot of (Scaled) New Cases by Gathering Restrictions
As evident from Figure 7, these conditional distributions do not appear to share a similar variance. This may as well, pose a potential issue when fitting our model. Additionally, there does not appear to be much of a difference between the number of new cases, with respect to the gathering restrictions.
Figure 8: Violin Plot of (Scaled) New Cases by School Closure Levels
As evident from Figure 8, the conditional distributions, again, appear to have different variances. Additionally, there do not appear to be any significant differences between the different school closure restrictions. Next, we shall analyze the distribution of the new corona virus cases with respect to the varying event restrictions.
Figure 9: Violin Plot of (Scaled) New Cases by Event Cancellation Severity
As evident from the violin plots presented in Figure 9, these conditional distributions, again, appear to differ in variance. This suggests that a transformation may be needed to coerce the response variable into the homoskedasticity assumption.
Finally, we analyze the relationship between the (scaled) number of new_deaths, with the (scaled) number of new cases.
Figure 10: Scatterplot of New Deaths vs. New Cases
Contrary to our intuition, there does not appear to be a clear, linear relationship between the (scaled) number of new deaths, and the (scaled) number of new cases. As a result, we shall attempt to fit a second-order term in our model-building process, in an effort to capture the relationship between New_deaths_scaled and New_cases_scaled. With this background information of the data in mind, we shall begin to produce our preliminary fit.
The purpose of this preliminary fit, is to gain inference on whether the variables selected (see Background) were significant in determining the number of new corona viruses cases. We now fit a simple, multiple regression model with all of the factors of interest, using the lm function from base.
\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + \beta_4X_{4i} + \beta_5X_{5i} + \beta_6X_{6i} +\epsilon_i \\ i = 1,2,...101,893, \text{ Where } \epsilon \stackrel{i.i.d.}{\sim} Normal(0, \sigma^2), \text{ and:} \\ X_1 = \text{WHO_region, } X_2 = \text{gathering_restrictions, } \\ X_3 = \text{school_closing, } X_4 = \text{cancel_events, }X_5 = \text{facial_coverings, and } X_6 = \text{New_deaths_scaled} \]
##
## Call:
## lm(formula = New_cases_scaled ~ WHO_region + factor(gatherings_restrictions) +
## factor(school_closing) + factor(cancel_events) + facial_coverings +
## New_deaths_scaled, data = covid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.846 -0.183 -0.047 0.053 61.676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.910e+02 2.011e+00 -94.965 < 2e-16
## WHO_regionAmericas 5.867e-02 1.043e-02 5.626 1.85e-08
## WHO_regionEastern Mediterranean 1.090e-01 1.136e-02 9.598 < 2e-16
## WHO_regionEurope 3.269e-01 9.745e-03 33.542 < 2e-16
## WHO_regionSouth-East Asia 7.340e-02 2.034e-02 3.609 0.000308
## WHO_regionWestern Pacific 1.039e-01 1.291e-02 8.043 8.87e-16
## factor(gatherings_restrictions)1 2.003e-01 2.199e-02 9.109 < 2e-16
## factor(gatherings_restrictions)2 -3.505e-02 1.538e-02 -2.279 0.022652
## factor(gatherings_restrictions)3 -1.611e-01 1.371e-02 -11.747 < 2e-16
## factor(gatherings_restrictions)4 -2.701e-01 1.449e-02 -18.639 < 2e-16
## factor(school_closing)1 -6.019e-02 1.036e-02 -5.808 6.35e-09
## factor(school_closing)2 -1.136e-01 1.213e-02 -9.366 < 2e-16
## factor(school_closing)3 -1.762e-01 1.157e-02 -15.235 < 2e-16
## factor(cancel_events)1 1.038e-01 1.352e-02 7.677 1.65e-14
## factor(cancel_events)2 2.710e-01 1.425e-02 19.021 < 2e-16
## facial_coveringsRecommended 5.820e-02 1.564e-02 3.722 0.000198
## facial_coveringsRequired Partially 7.332e-02 1.197e-02 6.127 8.99e-10
## facial_coveringsRequired in Public 1.840e-01 1.049e-02 17.545 < 2e-16
## facial_coveringsRequired Out of House 2.255e-01 1.205e-02 18.705 < 2e-16
## New_deaths_scaled 3.918e+01 4.024e-01 97.368 < 2e-16
##
## (Intercept) ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2 *
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## New_deaths_scaled ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.041 on 102832 degrees of freedom
## Multiple R-squared: 0.1217, Adjusted R-squared: 0.1216
## F-statistic: 750 on 19 and 102832 DF, p-value: < 2.2e-16
From the summary function output, we can see that every term in the model was significantly different from 0. However, the value of the \(R^2_{adj}\) is very low, suggesting that not much variability was explained by the model. To diagnose the potential issues, we shall utilize a residual a QQ-plot, and the Cook’s Distance Criteria. We set the threshold, that if a residual obtains a Cook’s distance of 0.1, or greater, then it will be considered an influential point.
Figure 11: Residual Plots for Preliminary Fit
As evident from the residual plot in Figure 11, the homoskedasticity assumption of the regression, appears to be violated. A fan effect is demonstrated in the residual plot, along with a few notable outliers. There also appears to be a slight curvature to the residuals. This suggests that a transformation may be needed to coerce the response variable into meeting the necessary assumptions. Additionally, we can see from the normality plot that the distribution is very right skewed. Next, we can see that there were a few outliers such as observation \(27427\)2, which had a Cook’s Distance of around 8 — indicating it to be a highly influential point. Before we move onto our next fit, we shall utilize the Box-Cox method, in order to find a suitable transformation for the data. Additionally, we shall remove all influential points – points with a Cook’s distance greater than 0.1.
Figure 12: Box-Cox Transformation Plot
According to the Box-Cox transformation method, the best suited transformation of the response variable is \(Y^{-2}\). Next, we will implement the transformation on our model, with the high influential points removed.
For our next model, we will take the advice of the Box-Cox procedure, and transform our response variable accordingly. Additionally, we shall include a second-order term for the New_deaths_scaled variable. We anticipate that the second-order term will be able to capture the curvature presented in Figure 10, earlier. Additionally, we will remove all the observations that had a Cook’s Distance measurement of \(0.1\) or greater.
\[ (Y_i)^{-2} = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + \beta_4X_{4i} \\ + \beta_5X_{5i} + \beta_6X_{6i} + \beta_7X_{7i} + \epsilon_i \\ i = 1,2,...101,893, \text{ Where } \epsilon \stackrel{i.i.d.}{\sim} Normal(0, \sigma^2), \text{ and:} \\ X_1 = \text{WHO_region, } X_2 = \text{gathering_restrictions, } \\ X_3 = \text{school_closing, } X_4 = \text{cancel_events, }X_5 = \text{facial_coverings, }\\ X_6 = \text{New_deaths_scaled, and } X_7 = \text{New_deaths_scaled}^2 \]
##
## Call:
## lm(formula = (New_cases_scaled)^(-2) ~ WHO_region + factor(gatherings_restrictions) +
## factor(school_closing) + factor(cancel_events) + facial_coverings +
## New_deaths_scaled + I(New_deaths_scaled^2), data = covid_outlier_free)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.090270 -0.000323 0.000343 0.001055 0.023833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.125e+01 5.874e-01 121.297 < 2e-16
## WHO_regionAmericas -3.102e-04 3.150e-05 -9.849 < 2e-16
## WHO_regionEastern Mediterranean -1.047e-03 3.442e-05 -30.413 < 2e-16
## WHO_regionEurope -1.806e-03 2.966e-05 -60.891 < 2e-16
## WHO_regionSouth-East Asia -5.613e-04 6.140e-05 -9.142 < 2e-16
## WHO_regionWestern Pacific -5.234e-04 3.898e-05 -13.428 < 2e-16
## factor(gatherings_restrictions)1 -4.074e-04 6.638e-05 -6.137 8.42e-10
## factor(gatherings_restrictions)2 3.459e-04 4.643e-05 7.449 9.49e-14
## factor(gatherings_restrictions)3 5.342e-04 4.140e-05 12.903 < 2e-16
## factor(gatherings_restrictions)4 8.848e-04 4.375e-05 20.224 < 2e-16
## factor(school_closing)1 3.898e-04 3.128e-05 12.459 < 2e-16
## factor(school_closing)2 8.269e-04 3.662e-05 22.584 < 2e-16
## factor(school_closing)3 8.774e-04 3.492e-05 25.124 < 2e-16
## factor(cancel_events)1 -5.103e-04 4.083e-05 -12.497 < 2e-16
## factor(cancel_events)2 -1.031e-03 4.304e-05 -23.947 < 2e-16
## facial_coveringsRecommended -6.322e-04 4.721e-05 -13.392 < 2e-16
## facial_coveringsRequired Partially -1.008e-03 3.613e-05 -27.904 < 2e-16
## facial_coveringsRequired in Public -1.144e-03 3.175e-05 -36.019 < 2e-16
## facial_coveringsRequired Out of House -1.254e-03 3.655e-05 -34.295 < 2e-16
## New_deaths_scaled -2.797e+01 2.330e-01 -120.049 < 2e-16
## I(New_deaths_scaled^2) 2.746e+00 2.311e-02 118.847 < 2e-16
##
## (Intercept) ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2 ***
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## New_deaths_scaled ***
## I(New_deaths_scaled^2) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003143 on 102828 degrees of freedom
## Multiple R-squared: 0.452, Adjusted R-squared: 0.4519
## F-statistic: 4240 on 20 and 102828 DF, p-value: < 2.2e-16
As evident from the summary output, we can see that every term in the model is statistically significant. Additionally, we can see that the value of the \(R^2_{adj}\) has increased to .4515. With this improvement in the model, we shall now turn to residual analysis in order to ensure that our model assumptions have been met.
Figure 13: Residual Plots for Fit 1
As evident from the residual plot, there appears to be multiple outliers present. Of particular concern, is observation \(47676\). The homoskedasticity assumption appears to have been slightly improved but still violated. Additionally, we can see that the normality assumption of the data has improved, but still deviates slightly along the more extreme theoretical quantiles. Next, from the Cook’s Distance plots, we can see that there were two largely influential points: observations \(47676\)3, and \(33909\)4. In hopes of improving the homoskedasticity and the normality assumptions, we shall remove these observations, and refit the model.
##
## Call:
## lm(formula = (New_cases_scaled)^(-2) ~ WHO_region + factor(gatherings_restrictions) +
## factor(school_closing) + factor(cancel_events) + facial_coverings +
## New_deaths_scaled + I(New_deaths_scaled^2), data = covid_without_residuals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.037709 -0.000309 0.000318 0.001028 0.021691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.898e+01 7.129e-01 124.813 < 2e-16
## WHO_regionAmericas -2.844e-04 3.122e-05 -9.110 < 2e-16
## WHO_regionEastern Mediterranean -9.726e-04 3.416e-05 -28.473 < 2e-16
## WHO_regionEurope -1.733e-03 2.945e-05 -58.854 < 2e-16
## WHO_regionSouth-East Asia -5.445e-04 6.086e-05 -8.947 < 2e-16
## WHO_regionWestern Pacific -5.113e-04 3.863e-05 -13.234 < 2e-16
## factor(gatherings_restrictions)1 -3.938e-04 6.579e-05 -5.986 2.16e-09
## factor(gatherings_restrictions)2 3.300e-04 4.602e-05 7.171 7.49e-13
## factor(gatherings_restrictions)3 5.193e-04 4.104e-05 12.655 < 2e-16
## factor(gatherings_restrictions)4 8.619e-04 4.336e-05 19.876 < 2e-16
## factor(school_closing)1 3.878e-04 3.100e-05 12.507 < 2e-16
## factor(school_closing)2 8.466e-04 3.629e-05 23.327 < 2e-16
## factor(school_closing)3 8.673e-04 3.461e-05 25.058 < 2e-16
## factor(cancel_events)1 -4.891e-04 4.047e-05 -12.085 < 2e-16
## factor(cancel_events)2 -9.872e-04 4.267e-05 -23.135 < 2e-16
## facial_coveringsRecommended -6.045e-04 4.679e-05 -12.920 < 2e-16
## facial_coveringsRequired Partially -9.924e-04 3.581e-05 -27.710 < 2e-16
## facial_coveringsRequired in Public -1.094e-03 3.148e-05 -34.761 < 2e-16
## facial_coveringsRequired Out of House -1.178e-03 3.627e-05 -32.482 < 2e-16
## New_deaths_scaled -3.502e+01 2.830e-01 -123.759 < 2e-16
## I(New_deaths_scaled^2) 3.446e+00 2.808e-02 122.744 < 2e-16
##
## (Intercept) ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2 ***
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## New_deaths_scaled ***
## I(New_deaths_scaled^2) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003115 on 102826 degrees of freedom
## Multiple R-squared: 0.4617, Adjusted R-squared: 0.4616
## F-statistic: 4410 on 20 and 102826 DF, p-value: < 2.2e-16
Figure 14: Residual Plots for Second Fit
As evident from the residual plots, there are still violations of the homoskedasticity and normality assumptions. Additionally, more influential points have appeared as a result of the refitting. Therefore, we shall refit the model with these influential points removed.
##
## Call:
## lm(formula = (New_cases_scaled)^(-2) ~ WHO_region + factor(gatherings_restrictions) +
## factor(school_closing) + factor(cancel_events) + facial_coverings +
## New_deaths_scaled + I(New_deaths_scaled^2), data = covid_without_residuals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.036757 -0.000305 0.000313 0.001023 0.020891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.382e+01 7.425e-01 126.353 < 2e-16
## WHO_regionAmericas -2.798e-04 3.115e-05 -8.983 < 2e-16
## WHO_regionEastern Mediterranean -9.561e-04 3.408e-05 -28.054 < 2e-16
## WHO_regionEurope -1.716e-03 2.938e-05 -58.409 < 2e-16
## WHO_regionSouth-East Asia -5.398e-04 6.070e-05 -8.892 < 2e-16
## WHO_regionWestern Pacific -5.073e-04 3.854e-05 -13.164 < 2e-16
## factor(gatherings_restrictions)1 -3.903e-04 6.562e-05 -5.948 2.73e-09
## factor(gatherings_restrictions)2 3.266e-04 4.591e-05 7.114 1.13e-12
## factor(gatherings_restrictions)3 5.155e-04 4.094e-05 12.592 < 2e-16
## factor(gatherings_restrictions)4 8.559e-04 4.326e-05 19.787 < 2e-16
## factor(school_closing)1 3.864e-04 3.093e-05 12.494 < 2e-16
## factor(school_closing)2 8.476e-04 3.620e-05 23.412 < 2e-16
## factor(school_closing)3 8.672e-04 3.453e-05 25.118 < 2e-16
## factor(cancel_events)1 -4.858e-04 4.037e-05 -12.034 < 2e-16
## factor(cancel_events)2 -9.798e-04 4.257e-05 -23.017 < 2e-16
## facial_coveringsRecommended -6.007e-04 4.668e-05 -12.870 < 2e-16
## facial_coveringsRequired Partially -9.880e-04 3.573e-05 -27.655 < 2e-16
## facial_coveringsRequired in Public -1.079e-03 3.141e-05 -34.357 < 2e-16
## facial_coveringsRequired Out of House -1.160e-03 3.619e-05 -32.057 < 2e-16
## New_deaths_scaled -3.694e+01 2.948e-01 -125.332 < 2e-16
## I(New_deaths_scaled^2) 3.638e+00 2.925e-02 124.350 < 2e-16
##
## (Intercept) ***
## WHO_regionAmericas ***
## WHO_regionEastern Mediterranean ***
## WHO_regionEurope ***
## WHO_regionSouth-East Asia ***
## WHO_regionWestern Pacific ***
## factor(gatherings_restrictions)1 ***
## factor(gatherings_restrictions)2 ***
## factor(gatherings_restrictions)3 ***
## factor(gatherings_restrictions)4 ***
## factor(school_closing)1 ***
## factor(school_closing)2 ***
## factor(school_closing)3 ***
## factor(cancel_events)1 ***
## factor(cancel_events)2 ***
## facial_coveringsRecommended ***
## facial_coveringsRequired Partially ***
## facial_coveringsRequired in Public ***
## facial_coveringsRequired Out of House ***
## New_deaths_scaled ***
## I(New_deaths_scaled^2) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003107 on 102822 degrees of freedom
## Multiple R-squared: 0.4635, Adjusted R-squared: 0.4634
## F-statistic: 4441 on 20 and 102822 DF, p-value: < 2.2e-16
As evident from summary output, we can see that the \(R^2_{adj}\) has increased to a value of 46.39%. Additionally, we can see that every term in the model was significantly different from 0. Although it may be tempting to utilize the hypothesis tests presented in the summary output, we must analyze the residual plots in order to determine whether the assumptions of the hypothesis tests have been met.
Figure 15: Residual Plots for Final Fit
As evident from the “brick-like” pattern in the residual plot, it appears that the homoskedasticity assumption may still be violated. However, when noticing how small the spread of the residuals is (within \(\stackrel{+}{-} .04\) of \(0\)), I believe that the violation in the homoskedasticity assumption is not too extreme. Therefore, I would argue that we can be lenient with this assumption. Additionally, the QQ-plot of the residuals suggests that the normality assumption of the model may have been violated. However. due to the large sample size of this dataset, I believe that we may be more lenient with these assumptions. According to Schmidt (2018), the regression model is robust with respect to departures in the normality assumption, for large sample sizes. As we have around \(100,000\) observations used in our regression model, our regression model should be resilient to the residuals’ departure from normality.
Note: The methods utilized in this section, assume that the homoskedasticity assumption has been met. As argued above, since the majority of the residuals consistently fall within a range of .04 of 0, it is argued that we can be lenient with the homoskedasticity violation.
In order to test whether government intervention had a significant effect on the number of corona virus cases, we will utilize a likelihood ratio test. As our normality assumption may be violated, we must use an alternative to the F-test. However, as we have a large sample size, we should be able to approximate a p-value, utilizing the fact the asymptotically, the LRT statistic follows a \(\chi^2\) distribution.
\[H_0: \text{At None of the Government Intervention Factors had a Significant Effect on the Number of New Cases} \\ H_1: \text{At Least one of the Government Intrvention Factos had a Significant Effect on the Number of New Cases}\]
As evident from the results of the likelihood ratio test, we can conclude that at least one of the government intervention variables had a significant effect on the number of corona virus cases.
Now that we have successfully determined that at least one of the government intervention variables had a significant effect on the number of cases, we shall now utilize the estimated coefficients of the parameters of interest, as well as their respective t-tests, in an attempt to determine what effect the government intervention had on the number of new cases.
Table 4: Estimates for Coefficients in Final Model
As evident from Table 4 we can see that each of the WHO regions differed in the number of (scaled) corona virus cases. For the gatherings_restrictions levels, we can see that as the severity of the restrictions increased, the larger the transformed response grew. As we enacted a \(Y^{(-2)}\) transformation, this implies that increasing the severity of the gatherings_restrictions (while holding all other factors constant) resulted in an increase of corona virus cases — which is very counter-intuitive. However, as we can define our original response, \(Y'\) as: \(Y' = 1/\sqrt(Y)\) , this implies that there would be more corona virus cases, with the less severe options of gatherings. The same results can be observed for the school_closing factor as well.
Based off of the estimates for the facial_coverings coefficients, we can see that requiring facial coverings partially was associated with higher corona virus cases, than the other facial covering restrictions (when holding other predictors fixed). In contrast, requiring facial coverings in public (although associated with an increase in corona virus cases) resulted in the smallest increase in corona virus cases (while holding other predictors fixed).
Based off of the estimates for the WHO_region coefficients, we can see that when an observation came from Europe, it was associated with more corona virus cases than the other WHO designated regions. Additionally, when the observation came from the Eastern Mediterranean, it would results in less corona virus cases than if the observation came from another region (while holding all other factors fixed).
In this dataset, causal inference is not feasible. The reason we are unable to conduct causal inference in this scenario, is that the Stable Unit Treatment Value Assumptions (SUTVA) have been violated. One of the most demonstrative violations of the SUTVA, lies in the facial_covering variable. Some people have utilized N95 masks as face coverings throughout the pandemic, while others have opted for less- protective options, such as Face Shields or neck-gaiters. Since this treatment is tailored to each individuals personal preferences, we cannot make causal inference. Additionally, we cannot utilize the ignorability assumption, as there are no covariates for which we can assume that the assignment of our factors are conditionally independent of the potential outcomes. Therefore, we are left to suffer the drawbacks associated with a lack of randomization.
Overall, when deciding to be lenient with the homoskedasticity assumption, we can conclude that the government intervention factors had a significant effect on the number of new corona virus cases. Additionally, we can conclude that there were, in fact, differences in the number of new corona virus cases, with respect to region. (even when taking population into account). However, we must take caution when making these conclusions; as our model seemingly violated the assumption of homoskedasticity for the error term, our results may be confounded. An additional drawback associated with this analysis, is the unequal representation of different regions in this dataset. As shown in our Exploratory Data Analysis, most of the observations came from Europe. This leaves the possibility of sampling bias within our results. However, due to an overall large sample size, containing large subsamples from each region, I believe that this analysis will be resilient to both the sampling bias, as well as the heteroskedasticity assumption present within our residuals.
Amand F. Schmidt, Chris Finan, Linear regression and the normality assumption, Journal of Clinical Epidemiology, Volume 98, 2018, Pages 146-151, ISSN 0895-4356,
Centers for Disease Control and Prevention. (2016, March 8). Case fatality ratios. Centers for Disease Control and Prevention. Retrieved March 11, 2022, from https://www.cdc.gov/foodnet/reports/data/case-fatality.html
Ritchie, H., Mathieu, E., Rodés-Guirao, L., Appel, C., Giattino, C., Ortiz-Ospina, E., Hasell, J., Macdonald, B., Beltekian, D., & Roser, M. (2020, March 5). Coronavirus (COVID-19) testing. Our World in Data. Retrieved March 11, 2022, from https://ourworldindata.org/coronavirus-testing
Population, total. Data. (n.d.). Retrieved March 11, 2022, from https://data.worldbank.org/indicator/SP.POP.TOTL
Wikimedia Foundation. (2022, February 6). Demographics of eritrea. Wikipedia. Retrieved March 11, 2022, from https://en.wikipedia.org/wiki/Demographics_of_Eritrea
World Health Organization. (n.d.). World Health Organization. Retrieved March 11, 2022, from https://covid19.who.int/info
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.
Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in Regression Relationships. R News 2(3), 7-10. URL https://CRAN.R-project.org/doc/Rnews/
Bob Rudis and Dave Gandy (2017). waffle: Create Waffle Chart Visualizations in R. R package version 0.7.0. https://CRAN.R-project.org/package=waffle
Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra
Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
Nicholas Tierney, Di Cook, Miles McBain and Colin Fay (2021). naniar: Data Structures, Summaries, and Visualisations for Missing Data. R package version 0.6.1. https://CRAN.R-project.org/package=naniar
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Most countries’ populations are based off of data from the year 2020. However, some countries such as Eritrea, haven’t conducted an official census as recently, and thus, estimates of the country’s population have been replaced with 2011 estimates.↩︎
This observation was from Ecuador, on July 21 2021. I attempted to research the cause behind this extreme observation, but there were no news articles containing information on coronavirus in Ecuador, circa July, 2021.↩︎
This observation was from Kazahkstan, on March 22, 2021.↩︎
This observation was from France, on April 4, 2020. The reason this observation had such an extreme number of new cases, was due to the initial corona virus outbreak in Europe.↩︎
Numerous models, in the family of general linear models, quantile regression, and IRWLS regression were attempted to fit this dataset. The model presented here, yielded the “best-behaved” residuals.↩︎